Data Cleaning Protocol

The aim of this document is to derive and present a method of multiple steps to be taken that convert the raw personal environmental data into an analysis ready dataset. For this I selected the data of 3 participants: ACT001D (very good data from visual inspection), ACT014F (some poor data), ACT032V (very poor data). The variables to be cleaned are temperature, relative humidity RH, and noise.

Main reasons for cleaning are:

The steps are the following with their relevance indicated with * (in my opinion):

  1. Study design limitations (extracted from PVL) *** (done)
  2. Physically possible (temperature > 0K, noise > 0 dB, RH < 100% etc). ** (done)
  3. Physically plausible (temperature < -10 °C etc, taped temperature < 20 °C) *** (discuss)
  4. Variability (variable specific threshold of multiple sigma from mean) *** (tune)
  5. Sequence (large jumps and repeating values) * (tune)
  6. Comparison of the same variables from different devices (taped T == house T) ** (discuss)

1. Study Design: Observation period and excluding PVL-visits

First, data has to be excluded that was taken outside the observation window and during personal visit log times where the device was potentially taken down and charged. Both was done while compiling all the data to one data set and can be reviewed in the Week1_Data_Loop.R file. Below you can see the data from the 3 individuals with gaps where the pvls took place.


2. Physically possible

Every Variable (temperature, RH, noise) has its physical limits that the following:

  1. Temperature: < -273 °C
  2. RH: < 0 % and > 100 %
  3. Noise: < 0 dB
# House
data_H <- data_H |>
  mutate(IBH_TEMP_01 = if_else(IBH_TEMP < -273, 1, 0),
         IBH_HUM_01 = if_else(IBH_HUM < 0 | IBH_HUM > 100, 1, 0))

# Worn
data_W <- data_W |>
  mutate(IBW_TEMP_01 = if_else(IBW_TEMP < -273, 1, 0),
         IBW_HUM_01 = if_else(IBW_HUM < 0 | IBW_HUM > 100, 1, 0))

# Taped
data_T <- data_T |>
  mutate(IBT_TEMP_01 = if_else(IBT_TEMP < -273, 1, 0))

# Noise
data_N <- data_N |>
  mutate(NS_01 = if_else(NS < 0, 1, 0))

No plots are shown here because there are no impossible values in the example data.


3. Physically plausible

The plausible range is to some degree subjective, depends on the observation surroundings and changes not only depending on the variable, but also what the variable describes (temperature taped and house). Therefore now we need to start with device specific variable value ranges.

  1. House: Temperature < 0 °C and > 55 °C, RH: no additional filtering
  2. Worn: Temperature < 15 °C and > 45 °C, RH: no additional filtering
  3. Taped: Temperature < 33 °C and > 39 °C
  4. Noise: no additional filtering
# House
data_H <- data_H |>
  mutate(IBH_TEMP_02 = if_else(IBH_TEMP < 0 | IBH_TEMP > 55, 1, 0))

# Worn
data_W <- data_W |>
  mutate(IBW_TEMP_02 = if_else(IBW_TEMP < 15 | IBW_TEMP > 45, 1, 0))

# Taped
data_T <- data_T |>
  mutate(IBT_TEMP_02 = if_else(IBT_TEMP < 33 | IBT_TEMP > 39, 1, 0))


4. Variability

The variability between variables and devices differs significantly (eg. humidity house and worn). Because we are interested in stress experienced by the individuals, it is important to not filter out extreme but realistic conditions as these represent the largest stress impact. However we do want to filter out worn and taped measurements that resemble the variance of the house measurements and indicate the the device was not worn/taped. For the worn data, we use the moving standard deviation of 3 centered humidity values and for the taped data the same but from temperature values.

  1. Worn: threshold for humidity sd: 6
  2. Taped: threshold for temperature sd: 0.1
# Worn
data_W <- data_W |>
  mutate(IBW_TEMP_07 = if_else(IBW_HUM_MSD < 0.75, 1, 0),
         IBW_HUM_07 = if_else(IBW_HUM_MSD < 0.75, 1, 0))

# Taped
data_T <- data_T |>
  mutate(IBT_TEMP_07 = if_else(IBT_TEMP_MSD < 0.05, 1, 0))


5. Sequence

Important for the sequence analysis is to keep the measuring intervals in mind. For temperature and humidity this is 15 minutes while noise was measured every minute. Again, the feasibility of a variable changes depending on the variability itself and the object that is being observed (room and body temperature).

5.1 Large jumps in the data

For this we need to come up with thresholds that should be exceeded from one value to the other, depending on the time interval, for temperature in 15 minutes. House and worn temperature can be very different but humidity should differ as the human body acts as a source of moisture. Noise can be very sporadic and intense and will not be filtered here.

  1. House: Temperature 3 °C, Humidity 20 %
  2. Worn: Temperature 6 °C, Humidity 35 %
  3. Taped: Temperature 3 °C
# House
data_H <- data_H |>
  mutate(IBH_TEMP_03 = if_else(abs(IBH_TEMP - lag(IBH_TEMP, default = first(IBH_TEMP))) > 3, 1, 0),
         IBH_HUM_03 = if_else(abs(IBH_HUM - lag(IBH_HUM, default = first(IBH_HUM))) > 20, 1, 0))

# Worn
data_W <- data_W |>
  mutate(IBW_TEMP_03 = if_else(abs(IBW_TEMP - lag(IBW_TEMP, default = first(IBW_TEMP))) > 6, 1, 0),
         IBW_HUM_03 = if_else(abs(IBW_HUM - lag(IBW_HUM, default = first(IBW_HUM))) > 35, 1, 0))

# Taped
data_T <- data_T |>
  mutate(IBT_TEMP_03 = if_else(abs(IBT_TEMP - lag(IBT_TEMP, default = first(IBT_TEMP))) > 3, 1, 0))

5.2 No changes over time

The logic behind this filter is that in an organic environment, nothing is exactly the same over time. Everything changes always ever so slightly. The absence of change indicates a malfunction (house observations) and mismeasurements (taped). This, however, could be disregarded as the exposure should be very close to the mismeasured value. Therefore this filter will be excluded for now but thresholds for deviance from the previous values should be at least the last digit of the available digits:

  1. House: Temperature 0.0001 °C, Humidity 0.01 %
  2. Worn: Temperature 0.0001 °C, Humidity 0.01 %
  3. Taped: Temperature 0.001 °C
  4. Noise: 0.0001 dB
# WARNING Both outcomes are set to 0

# House
data_H <- data_H |>
  mutate(IBH_TEMP_04 = if_else(abs(IBH_TEMP - lag(IBH_TEMP, default = first(IBH_TEMP))) < 0.00001, 0, 0),  
         IBH_HUM_04 = if_else(abs(IBH_HUM - lag(IBH_HUM, default = first(IBH_HUM))) < 0.001, 0, 0))

# Worn
data_W <- data_W |>
  mutate(IBW_TEMP_04 = if_else(abs(IBW_TEMP - lag(IBW_TEMP, default = first(IBW_TEMP))) < 0.00001, 0, 0),
         IBW_HUM_04 = if_else(abs(IBW_HUM - lag(IBW_HUM, default = first(IBW_HUM))) < 0.001, 0, 0))

# Taped
data_T <- data_T |>
  mutate(IBT_TEMP_04 = if_else(abs(IBT_TEMP - lag(IBT_TEMP, default = first(IBT_TEMP))) < 0.00001, 0, 0)) 

# Noise 
data_N <- data_N |>
  mutate(NS_04 = ifelse(abs(NS - lag(NS, default = first(NS))) < 0.00001, 0, 0)) 

I suggest we should rather flag values that have not changed over a longer period of time like 8 intervals (2 hours). For this I utilized a backwards looking moving average.

# WARNING Both outcomes are set to 0

# House
data_H <- data_H |>
  mutate(IBH_TEMP_05 = if_else(IBH_TEMP == IBH_TEMP_MA, 0, 0),
         IBH_HUM_05 = if_else(IBH_HUM == IBH_HUM_MA, 0, 0))

# Worn
data_W <- data_W |>
  mutate(IBW_TEMP_05 = if_else(IBW_TEMP == IBW_TEMP_MA, 0, 0),
         IBW_HUM_05 = if_else(IBW_HUM == IBW_HUM_MA, 0, 0))

# Taped
data_T <- data_T |>
  mutate(IBT_TEMP_05 = if_else(IBT_TEMP == IBT_TEMP_MA, 0, 0))

# Noise
data_N <- data_N |>
  mutate(NS_05 = if_else(NS == NS_MA, 0, 0))

However, this leads to flagging of values after a slope change and consequently is not a perfect method to detect unchanged values.

6. Comparison of devices

This correction investigates whether worn or taped devices were taken off which I believe is crucial in this data cleaning. For this, we use the house device as a reference. The issue here is that the timestamps are not exactly the same but can be shifted by up to 7 minutes and 30 seconds (due to the 15 min interval). Therefore, we will compare each taped and worn measurement with the measurement taken previously by the house device. As an indicator for this we will use only temperature as house and worn humidity can likely be very similar even with proper measurements.

The worn and taped temperature have to be within +- 0.1 °C of the house temperature to be flagged. Humidity measurements are flagged if the temperature measurements are flagged as they do no represent personal exposure anymore.

data_match_HW <- fuzzy_left_join(
  data_H,
  data_W,
  by = "datetime",
  match_fun = list(`<=`)
) |> 
  group_by(datetime.x) |> 
  slice_min(datetime.y) |>  
  ungroup() |> 
  mutate(Match_TEMP_HW = ifelse(IBW_TEMP < (IBH_TEMP + 0.1) & IBW_TEMP > (IBH_TEMP - 0.1), 1, 0)) |> 
  select(uid.x, datetime.x, IBH_TEMP, IBW_TEMP, IBW_HUM, Match_TEMP_HW) 

data_match_HT <- fuzzy_left_join(
  data_H,
  data_T,
  by = "datetime",
  match_fun = list(`<=`)
) |> 
  group_by(datetime.x) |> 
  slice_min(datetime.y) |>  
  ungroup() |> 
  mutate(Match_TEMP_HT = ifelse(IBT_TEMP < (IBH_TEMP + 0.1) & IBT_TEMP > (IBH_TEMP - 0.1), 1, 0)) |> 
  select(uid.x, datetime.x, IBH_TEMP, IBT_TEMP, Match_TEMP_HT)

FINAL PRODUCT

The final cleaning is the combination of all the above filters.

Take away

The most suitable methods for filtering are:

  • PVL visits as a reference for cutting the data
  • possible values for all to exclude unsensical data
  • plausible values for taped temperature
  • variability checks for worn temperature and humidity

The following methods were less suitable because..

  • .. sequence large jumps: hard to find a common threshold without filtering “true” data
  • .. sequence no changes over time: minimum change threshold show inaccuracy but dont misrepresent exposure, moving averages flag “true” data through pure chance especially after slope inversions
  • .. comparison of devices: values can be the same and still be “true”, hard to define a range around the reference house measurements


Feedback from Apolline and Lis

  • Study design limitation: only exclude data from PVLs if the corresponding device has been charged/changed

  • Physically plausible: lower threshold for worn temperature -> below 10 °C, threshold for taped temperature IQR

  • Variability: flag only values if 3 consecutive values have exceeded the threshold

  • Sequence: not useful

  • Final Plot: keep the the original unfiltered data in the background so that it is easier to evaluate the cleaning

  • Next steps: include 5 people, do the cleaning and show the final product, then calculate hourly averages and repeat the week1_summary code